feat(observability): emit failed_stage and failed_stage_delta_ms on pipeline.failed by chrisleekr · Pull Request #244 · chrisleekr/github-app-playground

chrisleekr · 2026-06-20T22:07:37Z

What changed and why

pipeline.failed previously carried only pipeline_wall_clock_ms + err, giving operators no signal about which stage threw or how long it had been running — they had to scroll back through pipeline.stage lines and mentally reconstruct the static stage order to diagnose a failure.

This adds a StageTracker cursor to runPipeline. Each of the seven timeStage calls and the inline prompt.build site receive it as an optional fourth argument: on stage entry the tracker is set, on success it is cleared, and on throw it is left intact. The outer catch reads stageTracker.active and conditionally spreads failed_stage + failed_stage_delta_ms into the pipeline.failed line. Both fields are omitted when no timed stage was in flight (an early failure before any stage, or after all stages succeeded).

One correctness detail: the trackingComment.finalize inner catch swallows its error and continues, so stageTracker.active is explicitly cleared there — otherwise a subsequent throw would be mis-attributed to finalize.

The change is additive: success path, control flow, and return values are unchanged. timeStage's tracker argument is optional, so every existing caller stays valid. New strict PipelineFailedLogSchema in src/core/log-fields.ts mirrors PipelineCompletedLogSchema, pinning the new fields so a rename or stray field trips the co-located test.

Flow

flowchart TD
  RP["runPipeline creates a StageTracker"]:::new
  STG["timeStage stage fn tracker"]:::keep
  SUCC["on success: tracker cleared<br/>pipeline.completed unchanged"]:::keep
  THROW["on throw: tracker keeps stage and startedAt"]:::new
  FAIL["pipeline.failed<br/>err + pipeline_wall_clock_ms<br/>+ failed_stage + failed_stage_delta_ms"]:::new
  RP --> STG
  STG --> SUCC
  STG --> THROW
  THROW --> FAIL
  classDef new fill:#196f3d,color:#ffffff
  classDef keep fill:#2c3e50,color:#ffffff

Changes

src/core/log-fields.ts — StageTracker interface + createStageTracker(); timeStage gains an optional tracker param (set on entry, clear on success, retain on throw, re-throw unchanged); strict PipelineFailedLogSchema + PipelineFailedLog.
src/core/pipeline.ts — create the tracker, thread it to the 7 timeStage calls + prompt.build, clear it in the finalize inner catch, and spread the two fields in the outer catch.
test/core/log-fields.test.ts — PipelineFailedLogSchema drift suite (valid; camelCase/extra-field/negative/wrong-event-literal/missing-required rejected) + timeStage tracker suite (clear-on-success, retain-on-throw, null-initial, no-tracker backward-compat).
docs/operate/observability.md — pipeline.failed row documents the two new fields.

Test plan

11 new log-fields tests pass; each runPipeline-exercising handler test (review/implement/resolve) passes in isolation
typecheck / lint / format / docs:build clean
Full bun test failure count is baseline-identical (no new failures); the only combined-run failures are pre-existing cross-file mock pollution the isolated CI runner avoids

Closes #226

Summary by CodeRabbit

Documentation
- Updated pipeline observability documentation with expanded failure event schemas and detailed payload field descriptions.
New Features
- Enhanced pipeline failure diagnostics: failures now record which stage was active at time of failure and the elapsed duration, enabling faster root cause analysis and debugging.

…ipeline.failed pipeline.failed carried only err + pipeline_wall_clock_ms, so an operator had to scan back for the last pipeline.stage line and know the static stage order to guess which stage threw. Add a StageTracker threaded as an optional 4th arg through the 7 timeStage calls + the prompt.build site: timeStage sets it on entry, clears on success, and leaves it set on throw, so the outer catch attributes failed_stage + failed_stage_delta_ms. The finalize inner catch clears the tracker so a later throw is not mis-attributed. New strict PipelineFailedLogSchema mirrors PipelineCompletedLogSchema. Additive only. Closes #226 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015v2bspPF1ZTTG9ZZrbBgA1

coderabbitai · 2026-06-20T22:07:50Z

Warning

Review limit reached

@chrisleekr, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 52 minutes and 49 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9327e70b-38f0-416a-82c2-8fd099a99db7

📥 Commits

Reviewing files that changed from the base of the PR and between b91f1c9 and 7760b9c.

📒 Files selected for processing (2)

src/core/log-fields.ts
test/core/log-fields.test.ts

📝 Walkthrough

Walkthrough

Adds failed_stage and failed_stage_delta_ms fields to pipeline.failed log events. Introduces PipelineFailedLogSchema, StageTracker interface, and createStageTracker() in log-fields.ts, extends timeStage with an optional tracker parameter, wires the tracker through all runPipeline timed phases, and updates tests and observability documentation.

Changes

Stage Attribution for Pipeline Failures

Layer / File(s)	Summary
`PipelineFailedLogSchema`, `StageTracker`, and `timeStage` contract `src/core/log-fields.ts`	Adds `PipelineFailedLogSchema`/`PipelineFailedLog` Zod contract with optional `failed_stage`/`failed_stage_delta_ms` fields, introduces `StageTracker` interface and `createStageTracker()` helper, and extends `timeStage` to accept an optional tracker that is set before awaiting `fn` and cleared only on success, leaving it populated for the outer catch on failure.
`runPipeline` stageTracker wiring and failure attribution `src/core/pipeline.ts`	Expands `log-fields` imports, initializes `stageTracker` at `runPipeline` entry, wraps all timed phases (`trackingComment.create`, `token.resolve`, `github.fetch`, `checkoutRepo`, `executor.invoke`, `trackingComment.finalize`, `workspace.cleanup`) with `timeStage(..., stageTracker)`, manually sets/clears `stageTracker.active` around the `prompt.build` phase, clears the tracker on finalization error, and emits `failed_stage`/`failed_stage_delta_ms` from `stageTracker.active` in the outer catch.
Schema tests, `timeStage` tracker tests, and observability docs `test/core/log-fields.test.ts`, `docs/operate/observability.md`	Adds `PipelineFailedLogSchema` validation tests (field acceptance, rejection of negative values, strict mode checks) and `timeStage` tracker behavior tests (success-path clear, failure-path preservation, null initialization, no-tracker compatibility); updates `observability.md` table with `failed_stage`/`failed_stage_delta_ms` fields and schema pinning notes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

chrisleekr/github-app-playground#182: Modifies the same src/core/log-fields.ts and src/core/pipeline.ts files to introduce timeStage-based pipeline.stage timing instrumentation, which this PR extends with tracker-based failure attribution.
chrisleekr/github-app-playground#225: Also modifies runPipeline in src/core/pipeline.ts around the same retry-wrapped pipeline phases where this PR adds stageTracker attribution.

Suggested labels

type: feature ✨, type: docs 📋

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title directly describes the primary feature being added: emitting failed_stage and failed_stage_delta_ms fields on pipeline.failed logs for observability.
Linked Issues check	✅ Passed	The implementation fully addresses all objectives from issue `#226`: structured error-attribution via failed_stage and failed_stage_delta_ms, RED-method gap bridging, cardinality-safe analytics, minimal bounded scope with StageTracker, PipelineFailedLogSchema validation, and backward compatibility preservation.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with issue `#226` objectives: log schema additions, StageTracker interface/function, timeStage signature update, runPipeline integration, tests, and documentation—no unrelated modifications present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR improves failure-path observability for the core execution pipeline by emitting stage attribution on the terminal pipeline.failed log line, so operators can immediately see which stage threw and how long it was running before failing.

Changes:

Introduces a StageTracker and threads it through runPipeline + timeStage to capture the currently active stage at the moment of failure.
Adds failed_stage and failed_stage_delta_ms to pipeline.failed (when applicable) and documents the new fields.
Adds a strict PipelineFailedLogSchema and corresponding tests to guard against future log-field drift.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`src/core/log-fields.ts`	Adds `StageTracker` + `createStageTracker()`, extends `timeStage()` with optional tracking, and introduces `PipelineFailedLogSchema`.
`src/core/pipeline.ts`	Creates and threads a stage tracker through stages; emits `failed_stage` + `failed_stage_delta_ms` on `pipeline.failed`.
`test/core/log-fields.test.ts`	Adds schema drift tests for `pipeline.failed` and behavior tests for `timeStage` tracking behavior.
`docs/operate/observability.md`	Documents the new `pipeline.failed` fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/core/log-fields.ts`:
- Around line 74-81: The PipelineFailedLogSchema object currently allows
failed_stage and failed_stage_delta_ms to be independently optional, but they
must always appear together or not at all according to the documented contract.
Add a validation constraint to the PipelineFailedLogSchema using Zod's refine
method to enforce that either both fields are present (defined) or both are
absent (undefined), rejecting any state where only one field is provided.

In `@src/core/pipeline.ts`:
- Around line 592-595: In the catch block for the cleanup error (the block
starting at line 593 that catches cleanupError), after logging the error
message, clear the stageTracker.active field similar to the pattern already used
in the trackingComment.finalize catch block around lines 510-513. This ensures
that when cleanup fails and the error is suppressed, the stageTracker state is
reset so the outer catch handler doesn't misattribute the original failure to
the "workspace.cleanup" stage.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9f4cdf94-e07f-4578-9bee-d568ca675527

📥 Commits

Reviewing files that changed from the base of the PR and between e1e7f9e and b91f1c9.

📒 Files selected for processing (4)

docs/operate/observability.md
src/core/log-fields.ts
src/core/pipeline.ts
test/core/log-fields.test.ts

…Schema Address review: failed_stage and failed_stage_delta_ms are emitted together or not at all, so a .refine rejects a record carrying only one. Add paired- contract regression tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015v2bspPF1ZTTG9ZZrbBgA1

…trict schemas Bundles 12 area:observability issues into one change. Each family ships a Zod-strict *-log-fields.ts module with a co-located drift-prevention test, following the #242/#244 pattern: structured_output, circuit, digest, github.api.slow, github.app.token.mint, http, scheduler.scan, workflow.run, workspace, agent.tool, daemon.connection, k8s.spawn. Behavioral changes beyond logging: #236 routes 6 mint sites through a new mintInstallationToken helper (exact cache_hit via scoped hook.before); #218 removes unused onConnected/onDisconnected WsClientOptions callbacks; #223 replaces octokit hook.after/hook.error with one hook.wrap timing closure (new GITHUB_API_SLOW_REQUEST_MS env, default 3000). Security: errSerializer now strips octokit event/payload/signature carriers so the raw webhook body + HMAC signature cannot leak through err: log lines (regression-tested). Docs: 12 observability.md sections + alerts, configuration.md env var, daemon-fleet.md k8s.spawn diagnostics. Closes #216 #217 #218 #223 #228 #233 #234 #235 #236 #237 #243 #247 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Wjb51MuzTGGiJ2ggGX2DhH

Copilot AI review requested due to automatic review settings June 20, 2026 22:07

github-actions Bot added the type: feature ✨ label Jun 20, 2026

Copilot started reviewing on behalf of chrisleekr June 20, 2026 22:08 View session

coderabbitai Bot added the type: docs 📋 label Jun 20, 2026

Copilot AI reviewed Jun 20, 2026

View reviewed changes

Comment thread src/core/log-fields.ts Outdated

Comment thread test/core/log-fields.test.ts

Comment thread test/core/log-fields.test.ts

Comment thread test/core/log-fields.test.ts

Comment thread test/core/log-fields.test.ts

coderabbitai Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread src/core/log-fields.ts Outdated

Comment thread src/core/pipeline.ts

chrisleekr merged commit 4f2483c into main Jun 20, 2026
15 checks passed

chrisleekr deleted the fix/issue-226 branch June 20, 2026 22:18

chrisleekr mentioned this pull request Jun 25, 2026

feat(observability): add 12 structured Pino event families with Zod-strict schemas #251

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): emit failed_stage and failed_stage_delta_ms on pipeline.failed#244

feat(observability): emit failed_stage and failed_stage_delta_ms on pipeline.failed#244
chrisleekr merged 2 commits into
mainfrom
fix/issue-226

chrisleekr commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chrisleekr commented Jun 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed and why

Flow

Changes

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chrisleekr commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading